Phosphorylated ribosome-IP captured mRNA from mouse olfactory epithelium. One group of mice smelled acetophenone for an hour, the other group did not. Because ribosome phosphorylation only occurs in activated neurons, we expect to find odorant receptors responding to acetophenone by looking for over-represented mRNA transcripts in stimulated samples.
(For details, see http://www.nature.com/neuro/journal/v18/n10/abs/nn.4104.html)
Let’s set up the environment where we work. Let’s assume that you have a folder on your desktop called achems, and we will use that folder as our working directory. We will download our input file (the count table) to that folder.
If a library hasn’t been installed yet, you can install it using e.g. install.packages("ggplot2"). For bioconductor libraries, use biocLite instead, e.g.
source("https://bioconductor.org/biocLite.R")
biocLite("edgeR")
setwd("~/Desktop/achems/")
system("curl -O https://raw.githubusercontent.com/Yue-Jiang/achems/master/count_table.tab")
library(ggplot2)
library(ggrepel)
library(dplyr)
library(tidyr)
library(edgeR)
library(grid)
library(gridExtra)
library(knitr)
library(superheat)
library(plotly)
library(clusterProfiler)
library(org.Mm.eg.db)
# just a theme for plotting! don't worry about it!
my_theme <- function(base_size=14) {
(theme_bw(base_size = base_size)
+ theme(plot.title = element_text(face = "bold",
size = rel(1.2), hjust = 0.5),
text = element_text(),
panel.background = element_rect(colour = NA),
plot.background = element_rect(colour = NA),
panel.border = element_rect(colour = NA),
axis.title = element_text(face = "bold",size = rel(1)),
axis.title.y = element_text(angle=90,vjust =2),
axis.title.x = element_text(vjust = -0.2),
axis.text = element_text(),
axis.ticks = element_line(),
panel.grid.major = element_line(colour="#f0f0f0"),
panel.grid.minor = element_blank(),
legend.key = element_rect(colour = NA)
)
)
}
Differential expression analysis softwares usually ask that you have a table of your expression data. Each software may have slightly different requirements in the way they want their input data prepared. Usually we start with a data frame with expression levels in counts (integer).
count_df <- read.table("count_table.tab", sep="\t", header=TRUE, stringsAsFactors=FALSE)
This is what the count table looks like (showing top 10 rows, 23481 rows total).
| Gene | Ctrl_A | Ctrl_B | Ctrl_C | Stim_A | Stim_B | Stim_C |
|---|---|---|---|---|---|---|
| 0610005C13Rik | 0 | 0 | 0 | 0 | 0 | 0 |
| 0610007N19Rik | 82 | 59 | 21 | 36 | 32 | 15 |
| 0610007P14Rik | 58 | 91 | 68 | 62 | 74 | 58 |
| 0610008F07Rik | 0 | 0 | 0 | 0 | 0 | 0 |
| 0610009B14Rik | 2 | 0 | 1 | 0 | 0 | 0 |
| 0610009B22Rik | 271 | 577 | 328 | 253 | 444 | 325 |
| 0610009D07Rik | 270 | 654 | 372 | 260 | 474 | 430 |
| 0610009L18Rik | 32 | 22 | 14 | 18 | 24 | 16 |
| 0610009O20Rik | 212 | 122 | 74 | 145 | 139 | 97 |
| 0610010B08Rik | 1 | 8 | 5 | 1 | 11 | 12 |
EdgeR asks that your input is a matrix.
countData <- as.matrix(count_df[, -1])
rownames(countData) <- count_df$Gene
# filter out genes lowly expressed, this is an arbitrary cutoff
counts <- countData[ rowSums(countData) > 5, ]
y <- DGEList(counts=counts)
Design table specifies experiment conditions for your samples. In this experiment, we have 6 samples with treatment (control vs stimulated) and littermate information (A, B, C). We are interested in genes differentially expressed in stimulated samples as compared to the control.
targets <- data.frame("Sample"=c("Ctrl_A", "Ctrl_B", "Ctrl_C", "Stim_A", "Stim_B", "Stim_C"),
"Treatment"=c("control", "control", "control", "stimulated", "stimulated", "stimulated"),
"Litter"=c("A", "B", "C", "A", "B", "C"))
Litter <- factor(targets$Litter)
Treat <- factor(targets$Treatment, levels=c("control","stimulated"))
The targets table (the table you made, more human readable) looks like this.
| Sample | Treatment | Litter |
|---|---|---|
| Ctrl_A | control | A |
| Ctrl_B | control | B |
| Ctrl_C | control | C |
| Stim_A | stimulated | A |
| Stim_B | stimulated | B |
| Stim_C | stimulated | C |
Let’s perform differential expression analysis using edgeR. Remember that we are only interested in the effect of treatment. The litter or batch effect should be included in the model as a random effect to absorb the variance explained by batch so that we get better sensitivity. If we didn’t have litter matched experiment design, we can only compare control and treatment as two homogeneous groups (which is also the most common use case, so let’s do it anyways).
design_nolitter <- model.matrix(~Treat)
The design looks like this.
| (Intercept) | Treatstimulated |
|---|---|
| 1 | 0 |
| 1 | 0 |
| 1 | 0 |
| 1 | 1 |
| 1 | 1 |
| 1 | 1 |
y_nolitter <- estimateDisp(y, design_nolitter)
fit_nolitter <- glmFit(y_nolitter, design_nolitter)
lrt_nolitter <- glmLRT(fit_nolitter)
de_nolitter <- topTags(lrt_nolitter, n=dim(lrt_nolitter)[1])$table
Without using litter (batch) information, we only recover 6 DE genes (FDR < 0.05).
kable(de_nolitter[de_nolitter$FDR < 0.05, ])
| logFC | logCPM | LR | PValue | FDR | |
|---|---|---|---|---|---|
| Tmem19 | -6.213243 | 3.498855 | 45.15677 | 0.00e+00 | 0.0000002 |
| Tmem190 | 6.203127 | 3.387871 | 45.02833 | 0.00e+00 | 0.0000002 |
| Olfr937 | 4.999412 | 3.561300 | 24.87306 | 6.00e-07 | 0.0035015 |
| Olfr923 | 3.499782 | 5.023108 | 21.31618 | 3.90e-06 | 0.0167017 |
| Olfr19 | 4.395690 | 4.797225 | 19.18782 | 1.18e-05 | 0.0399375 |
| Olfr1047 | 4.155271 | 3.624147 | 18.87346 | 1.40e-05 | 0.0399375 |
Now let’s do it properly, by explicitly accounting for litter effects.
design <- model.matrix(~Litter + Treat)
Now the design looks like this.
| (Intercept) | LitterB | LitterC | Treatstimulated |
|---|---|---|---|
| 1 | 0 | 0 | 0 |
| 1 | 1 | 0 | 0 |
| 1 | 0 | 1 | 0 |
| 1 | 0 | 0 | 1 |
| 1 | 1 | 0 | 1 |
| 1 | 0 | 1 | 1 |
y <- estimateGLMCommonDisp(y,design)
y <- estimateGLMTrendedDisp(y,design)
y <- estimateGLMTagwiseDisp(y,design)
fit <- glmFit(y, design)
lrt <- glmLRT(fit)
de <- topTags(lrt,n=dim(lrt)[1])$table
Now we get better sensitivity!
# number of DE genes now
sum(de$FDR < 0.05)
## [1] 80
Here are the top 10 most DE genes.
kable(de[order(de$FDR)[1:10], ])
| logFC | logCPM | LR | PValue | FDR | |
|---|---|---|---|---|---|
| Olfr745 | 3.259074 | 6.488694 | 129.73623 | 0 | 0 |
| Olfr376 | 3.294949 | 5.767395 | 113.52111 | 0 | 0 |
| Ranbp10 | 6.446126 | 4.140624 | 84.22813 | 0 | 0 |
| Olfr19 | 4.313269 | 4.793146 | 82.85546 | 0 | 0 |
| Olfr749 | 2.752854 | 6.484576 | 77.86890 | 0 | 0 |
| Tmem19 | -6.199616 | 3.500925 | 74.81520 | 0 | 0 |
| Tmem190 | 6.179645 | 3.379617 | 71.89756 | 0 | 0 |
| Pcdh10 | 2.203870 | 6.019768 | 70.09725 | 0 | 0 |
| Olfr30 | 2.634877 | 4.977079 | 66.96668 | 0 | 0 |
| Olfr937 | 4.954899 | 3.554047 | 62.04721 | 0 | 0 |
For clarity, let’s make a data frame to hold the data for plotting. The data frame will contain the following columns:
de_df <- data.frame("Gene"=rownames(de),
"log2CPM"=de$logCPM,
"log2FoldChange"=de$logFC,
"FDR"=de$FDR) %>%
# add an annotation column indicating whether this gene is an OR
mutate(Annotation=case_when(grepl("^Olfr", .$Gene) ~ "Odorant receptor",
TRUE ~ "Other genes")) %>%
# add another annotation column indicating whether gene is enriched / decreased / not DE
mutate(DE=case_when(.$FDR <= 0.05 & .$log2FoldChange > 0 ~ "Enriched (p<=0.05)",
.$FDR <= 0.05 & .$log2FoldChange < 0 ~ "Decreased (p<=0.05)",
TRUE ~ "p>0.05"))
For the purpose of this study, we are primarily interested in differentially expressed ORs only. As such, we should restrict the FDR correction to OR genes only, otherwise we are losing some sensitivity.
or_de_df <- data.frame("Gene"=rownames(de),
"log2CPM"=de$logCPM,
"log2FoldChange"=de$logFC,
"pvalue"=de$PValue) %>%
# only keep ORs (genes starting with Olfr)
filter(grepl("^Olfr", Gene)) %>%
# fill N/A adjusted p-values with 1's
mutate(FDR=p.adjust(pvalue, method="BH")) %>%
# add another annotation column indicating whether gene is enriched / decreased / not DE
mutate(DE=case_when(.$FDR <= 0.05 & .$log2FoldChange > 0 ~ "Enriched (p<=0.05)",
TRUE ~ "Others"))
An MA plot is plotting the log2 fold change vs mean expression level. We can customize the MA plot by highlighting genes of interest (for example, all odorant receptor genes).
p1 <- ggplot(de_df %>% arrange(desc(FDR)), aes(log2CPM, log2FoldChange, color=DE)) +
geom_point(alpha=0.8, size=1) +
scale_color_manual(values=c("forestgreen", "red", "grey80")) +
my_theme()
p2 <- ggplot(de_df %>% arrange(desc(Annotation)), aes(log2CPM, log2FoldChange, color=Annotation)) +
geom_point(alpha=0.8, size=1) +
scale_color_manual(values=c("red", "grey80")) +
my_theme()
grid.arrange(p1, p2, ncol=1)
Another popular way to visualize DE analysis is to plot log scaled p-values against log scaled fold change. We can customize the volcano plot by labeling the ORs that are enriched in stimulated samples.
ggplot(or_de_df, aes(log2FoldChange, log10(FDR), color=DE)) +
geom_point(alpha=0.8, size=1) +
geom_text_repel(data=or_de_df %>% filter(DE == "Enriched (p<=0.05)"), aes(label=Gene), segment.colour="grey50", color="grey10") +
scale_y_reverse() +
scale_color_manual(values=c("red", "grey80")) +
my_theme()
Well, that’s probably not that good an idea. Let’s label M72 (Olfr160) only.
ggplot(or_de_df, aes(log2FoldChange, log10(FDR), color=DE)) +
geom_point(alpha=0.8, size=1) +
geom_text_repel(data=or_de_df %>% filter(Gene == "Olfr160"),
label="M72",
min.segment.length=unit(0, 'lines'),
nudge_x=3,
segment.colour="grey50",
color="grey10") +
scale_y_reverse() +
scale_color_manual(values=c("red", "grey80")) +
my_theme()
Are certain pathways enriched in the DE genes?
de_genes <- as.character(filter(de_df, FDR <= 0.05)$Gene)
background_genes <- as.character(filter(de_df, logCPM >= log2(10))$Gene)
go <- enrichGO(gene=de_genes,
universe=background_genes,
OrgDb = org.Mm.eg.db, keytype = 'SYMBOL',
ont = "BP",
pAdjustMethod = "BH",
pvalueCutoff = 0.1,
qvalueCutoff = 0.1
)
kable(go@result)
| ID | Description | GeneRatio | BgRatio | pvalue | p.adjust | qvalue | geneID | Count | |
|---|---|---|---|---|---|---|---|---|---|
| GO:0007608 | GO:0007608 | sensory perception of smell | 29/76 | 160/8158 | 0.0000000 | 0.0000000 | 0.0000000 | Olfr745/Olfr376/Olfr19/Olfr749/Olfr30/Olfr937/Olfr136/Olfr923/Olfr1104/Olfr1502/Olfr1047/Olfr147/Olfr744/Olfr875/Olfr247/Olfr895/Olfr1504/Olfr1100/Olfr1463/Olfr874/Olfr983/Olfr876/Olfr1448/Olfr1377/Olfr397/Olfr491/Olfr1045/Olfr430/Olfr1054 | 29 |
| GO:0007606 | GO:0007606 | sensory perception of chemical stimulus | 29/76 | 175/8158 | 0.0000000 | 0.0000000 | 0.0000000 | Olfr745/Olfr376/Olfr19/Olfr749/Olfr30/Olfr937/Olfr136/Olfr923/Olfr1104/Olfr1502/Olfr1047/Olfr147/Olfr744/Olfr875/Olfr247/Olfr895/Olfr1504/Olfr1100/Olfr1463/Olfr874/Olfr983/Olfr876/Olfr1448/Olfr1377/Olfr397/Olfr491/Olfr1045/Olfr430/Olfr1054 | 29 |
| GO:0007186 | GO:0007186 | G-protein coupled receptor signaling pathway | 29/76 | 272/8158 | 0.0000000 | 0.0000000 | 0.0000000 | Olfr745/Olfr376/Olfr19/Olfr749/Olfr30/Olfr937/Olfr136/Olfr923/Olfr1104/Olfr1502/Olfr1047/Olfr147/Olfr744/Olfr875/Olfr247/Olfr895/Olfr1504/Olfr1100/Olfr1463/Olfr874/Olfr983/Olfr876/Olfr1448/Olfr1377/Olfr397/Olfr491/Olfr1045/Olfr430/Olfr1054 | 29 |
| GO:0007600 | GO:0007600 | sensory perception | 29/76 | 279/8158 | 0.0000000 | 0.0000000 | 0.0000000 | Olfr745/Olfr376/Olfr19/Olfr749/Olfr30/Olfr937/Olfr136/Olfr923/Olfr1104/Olfr1502/Olfr1047/Olfr147/Olfr744/Olfr875/Olfr247/Olfr895/Olfr1504/Olfr1100/Olfr1463/Olfr874/Olfr983/Olfr876/Olfr1448/Olfr1377/Olfr397/Olfr491/Olfr1045/Olfr430/Olfr1054 | 29 |
| GO:0050877 | GO:0050877 | neurological system process | 29/76 | 407/8158 | 0.0000000 | 0.0000000 | 0.0000000 | Olfr745/Olfr376/Olfr19/Olfr749/Olfr30/Olfr937/Olfr136/Olfr923/Olfr1104/Olfr1502/Olfr1047/Olfr147/Olfr744/Olfr875/Olfr247/Olfr895/Olfr1504/Olfr1100/Olfr1463/Olfr874/Olfr983/Olfr876/Olfr1448/Olfr1377/Olfr397/Olfr491/Olfr1045/Olfr430/Olfr1054 | 29 |
| GO:0050907 | GO:0050907 | detection of chemical stimulus involved in sensory perception | 4/76 | 38/8158 | 0.0004039 | 0.0202602 | 0.0202602 | Olfr376/Olfr19/Olfr397/Olfr430 | 4 |
| GO:0009593 | GO:0009593 | detection of chemical stimulus | 4/76 | 48/8158 | 0.0009923 | 0.0426686 | 0.0426686 | Olfr376/Olfr19/Olfr397/Olfr430 | 4 |
| GO:0050906 | GO:0050906 | detection of stimulus involved in sensory perception | 4/76 | 54/8158 | 0.0015463 | 0.0581777 | 0.0581777 | Olfr376/Olfr19/Olfr397/Olfr430 | 4 |
Since there are some redundancy in the GO terms, we can simply them.
sim_go <- simplify(go, cutoff = 0.7, by = "p.adjust", select_fun = min, measure = "Rel")
kable(sim_go@result)
| ID | Description | GeneRatio | BgRatio | pvalue | p.adjust | qvalue | geneID | Count | |
|---|---|---|---|---|---|---|---|---|---|
| GO:0007608 | GO:0007608 | sensory perception of smell | 29/76 | 160/8158 | 0 | 0 | 0 | Olfr745/Olfr376/Olfr19/Olfr749/Olfr30/Olfr937/Olfr136/Olfr923/Olfr1104/Olfr1502/Olfr1047/Olfr147/Olfr744/Olfr875/Olfr247/Olfr895/Olfr1504/Olfr1100/Olfr1463/Olfr874/Olfr983/Olfr876/Olfr1448/Olfr1377/Olfr397/Olfr491/Olfr1045/Olfr430/Olfr1054 | 29 |
| GO:0007186 | GO:0007186 | G-protein coupled receptor signaling pathway | 29/76 | 272/8158 | 0 | 0 | 0 | Olfr745/Olfr376/Olfr19/Olfr749/Olfr30/Olfr937/Olfr136/Olfr923/Olfr1104/Olfr1502/Olfr1047/Olfr147/Olfr744/Olfr875/Olfr247/Olfr895/Olfr1504/Olfr1100/Olfr1463/Olfr874/Olfr983/Olfr876/Olfr1448/Olfr1377/Olfr397/Olfr491/Olfr1045/Olfr430/Olfr1054 | 29 |
Sometimes we want to do exploratory analysis by clustering samples based on RNA expression profiles and see if the clustering makes biological sense. Let’s see a few examples.
Let’s cluster the samples based on OR expression. Well, they actually cluster by litter… So it’s important to account for litter effect as we did before!
orCountData <- countData[grepl("^Olfr", rownames(countData)) & rowSums(countData) > 5, ]
superheat::superheat(t(log1p(orCountData)),
pretty.order.rows = TRUE,
pretty.order.cols = TRUE,
left.label.size = 0.3,
left.label.text.size = 3,
left.label.text.alignment = "right",
left.label.col = "white",
# scale the matrix columns
scale = TRUE,
row.dendrogram = TRUE,
col.dendrogram = TRUE,
legend=FALSE)
Just to showcase some PCA visualization.
or_pca <- data.frame(prcomp(t(log1p(orCountData)), scale.=TRUE)$x)
plot_ly(or_pca, x = ~PC1, y = ~PC2, z= ~PC3, text = rownames(or_pca)) %>%
add_markers() %>%
add_text() %>%
layout(showlegend = FALSE)
Keep note of what packages were used and their versions for future reference.
devtools::session_info()
## Session info --------------------------------------------------------------
## setting value
## version R version 3.3.1 (2016-06-21)
## system x86_64, darwin13.4.0
## ui X11
## language (EN)
## collate en_US.UTF-8
## tz America/Los_Angeles
## date 2017-04-23
## Packages ------------------------------------------------------------------
## package * version date source
## annotate 1.50.0 2016-05-04 Bioconductor
## AnnotationDbi * 1.34.4 2016-07-08 Bioconductor
## assertthat 0.1 2013-12-06 CRAN (R 3.3.0)
## backports 1.0.4 2016-10-24 CRAN (R 3.3.0)
## base64enc 0.1-3 2015-07-28 CRAN (R 3.3.1)
## Biobase * 2.32.0 2016-05-04 Bioconductor
## BiocGenerics * 0.18.0 2016-05-04 Bioconductor
## clusterProfiler * 3.0.5 2016-08-19 Bioconductor
## colorspace 1.2-6 2015-03-11 CRAN (R 3.3.0)
## DBI 0.4-1 2016-05-08 CRAN (R 3.3.0)
## devtools 1.12.0 2016-06-24 CRAN (R 3.3.0)
## digest 0.6.12 2017-01-27 cran (@0.6.12)
## DO.db 2.9 2017-04-22 Bioconductor
## DOSE * 2.10.7 2016-07-21 Bioconductor
## dplyr * 0.5.0 2016-06-24 CRAN (R 3.3.0)
## edgeR * 3.14.0 2016-05-04 Bioconductor
## evaluate 0.10 2016-10-11 CRAN (R 3.3.1)
## ggdendro 0.1-20 2016-04-27 CRAN (R 3.3.0)
## ggplot2 * 2.2.1 2016-12-30 CRAN (R 3.3.2)
## ggrepel * 0.6.5 2016-11-24 CRAN (R 3.3.2)
## GO.db 3.3.0 2017-04-22 Bioconductor
## GOSemSim 1.30.3 2016-07-20 Bioconductor
## graph 1.50.0 2016-05-04 Bioconductor
## gridExtra * 2.2.1 2016-02-29 CRAN (R 3.3.0)
## GSEABase 1.34.1 2016-09-22 Bioconductor
## gtable 0.2.0 2016-02-26 CRAN (R 3.3.0)
## highr 0.6 2016-05-09 CRAN (R 3.3.0)
## htmltools 0.3.5 2016-03-21 CRAN (R 3.3.0)
## htmlwidgets 0.8 2016-11-09 CRAN (R 3.3.2)
## httr 1.2.1 2016-07-03 CRAN (R 3.3.0)
## igraph 1.0.1 2015-06-26 CRAN (R 3.3.0)
## IRanges * 2.6.1 2016-06-19 Bioconductor
## jsonlite 1.0 2016-07-01 CRAN (R 3.3.1)
## knitr * 1.15.1 2016-11-22 CRAN (R 3.3.2)
## labeling 0.3 2014-08-23 CRAN (R 3.3.0)
## lattice 0.20-33 2015-07-14 CRAN (R 3.3.1)
## lazyeval 0.2.0 2016-06-12 CRAN (R 3.3.0)
## limma * 3.28.21 2016-09-05 Bioconductor
## magrittr 1.5 2014-11-22 CRAN (R 3.3.0)
## MASS 7.3-45 2016-04-21 CRAN (R 3.3.1)
## matrixStats 0.50.2 2016-04-24 CRAN (R 3.3.0)
## memoise 1.0.0 2016-01-29 CRAN (R 3.3.0)
## munsell 0.4.3 2016-02-13 CRAN (R 3.3.0)
## org.Mm.eg.db * 3.3.0 2017-04-22 Bioconductor
## plotly * 4.5.6 2016-11-12 CRAN (R 3.3.2)
## plyr 1.8.4 2016-06-08 CRAN (R 3.3.0)
## purrr 0.2.2 2016-06-18 CRAN (R 3.3.0)
## qvalue 2.4.2 2016-05-16 Bioconductor
## R6 2.1.2 2016-01-26 CRAN (R 3.3.0)
## Rcpp 0.12.8 2016-11-17 cran (@0.12.8)
## reshape2 1.4.2 2016-10-22 cran (@1.4.2)
## rmarkdown 1.3 2016-12-21 CRAN (R 3.3.1)
## rprojroot 1.1 2016-10-29 CRAN (R 3.3.0)
## RSQLite 1.0.0 2014-10-25 CRAN (R 3.3.0)
## S4Vectors * 0.10.2 2016-07-08 Bioconductor
## scales 0.4.1 2016-11-09 cran (@0.4.1)
## SparseM 1.74 2016-11-10 cran (@1.74)
## stringi 1.1.2 2016-10-01 cran (@1.1.2)
## stringr 1.1.0 2016-08-19 CRAN (R 3.3.0)
## superheat * 0.1.0 2017-03-07 Github (rlbarter/superheat@88ed789)
## tibble 1.2 2016-08-26 cran (@1.2)
## tidyr * 0.6.0 2016-08-12 CRAN (R 3.3.0)
## topGO 2.24.0 2016-05-04 Bioconductor
## viridisLite 0.1.3 2016-03-12 CRAN (R 3.3.0)
## withr 1.0.2 2016-06-20 CRAN (R 3.3.0)
## XML 3.98-1.4 2016-03-01 CRAN (R 3.3.0)
## xtable 1.8-2 2016-02-05 CRAN (R 3.3.0)
## yaml 2.1.13 2014-06-12 CRAN (R 3.3.0)